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O ! Abstract 

Over the last 30 years the development of RNA secondary structure predic- 

^ ■ tion algorithms have been guided and inspired by corresponding combinatorial studies 

where the RNA molecules are modeled as certain kind of planar graphs. The other way 

l/^ ' round, new algorithmic ideas gave rise to interesting combinatorial problems asking for 

a deeper understanding of the structures processed. One such example is the notion 

order of a secondary structure as introduced by Waterman (as a parameter on graphs) 

^^ ' in 1978, which reflects a structure's overall complexity: Regarding so-called hairpin- 

r) , loops as the building blocks of a secondary structure, the order provides information on 

the (balanced) nesting-depth of hairpin-loops and thus on the overall complexity of the 

structure. In related prediction algorithms, one first searches for order 1 structures, 

increasing the allowed order step by step and thus considering an improved structural 

complexity in every iteration. 

Subsequently, Zucker et al. and Clote introduced a more realistic combinatorial model 
for RNA secondary structures, the so-called saturated secondary structures. Compared 
to the traditional model of Waterman, unpaired nucleotides (vertices) which are in fa- 
^T) , vorable position for a pairing do not exist, i.e. no base pair (edge) can be added without 

^D ' violating at least one restriction for the graphs. That way, one major shortcoming of 

the traditional model has been cleared. However, the resulting model gets much more 
JCT* I challenging from a mathematical point of view. As a consequence, so far only little is 

^"^ ■ known about the combinatorics of RNA saturated structures. 

In this paper we show how it is possible to attack saturated structures and especially 
how to analyze their order. This is of special interest since in the past it has been 
proven to be one of the most demanding parameters to address (for the traditional 
model it has been an open problem for more than 20 years to find asymptotic results 
C^ ' for the number of structures of given order and similar) . Wc show the expected order 

of RNA saturated secondary structures of size n is log4 n (l + O ( °^^" j j , if we select 
the saturated secondary structure uniformly at random. Furthermore, the order of 
saturated secondary structures is sharply concentrated around its mean. As a conse- 
quence saturated structures and structures in the traditional model behave the same 
with respect to the expected order. Thus we may conclude that the traditional model 
has already drawn the right picture and conclusions inferred from it with respect to 
the order (the overall shape) of a structure remain valid even if enforcing saturation 
(at least in expectation). 
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1. Introduction 

The building blocks of RNA are four different nucleotides {a, c, g,u} = T, 
which are linked to each other in a linear fashion. Accordingly, the so-called 
primary structure of RNA, i.e. the linear sequence of building blocks, is modeled 
as string over S. In addition, non-neighboring nucleotides have a second means 
of binding by which certain combinations of nucleotides {a — u, c — g and g — u) 
may form pairs, i.e. stick to each other. This gives rise to a 3D folding of the 
molecule which in many cases determines its biological function. Each such 
pair reduces the so-called free energy of the molecule and the conformation of 
minimal free energy is adopted in nature. Today, lab techniques to determine 
the primary structure of RNA are cheap and efficient while determining the 3D 
structure still is a time-consuming and expensive task. Accordingly one aims for 
algorithms to predict the structure from the sequence. However, even if building 
on rather simple models for the free energy, its minimization becomes an J^P- 
complete problem when allowing arbitrary foldings [10]. As a consequence, the 
set of considered structures is constrained and so-called secondary structures 
are considered as the first step towards understanding RNA biological function. 
There, only non-crossing pairings of nucleotides are allowed such that - ignoring 
types of nucleotides ~ the molecule can be represented as a planar graph |16j 
(see Figure [LT]) or alternatively by strings over {.,(,)} where a . represents 
an unpaired nucleotide and a pair of corresponding brackets represents two 
paired nucleotides (the left structure of Figure 11.11 is in correspondence with 

((..((( )))■)))• Even if computing a structure of minimum free energy (mfe) 

becomes efficient for secondary structures (algorithms with cubic time-bounds 
are well-known), the empiric thermodynamic data used are incomplete and 
erroneous such that suboptimal solutions need to be taken into consideration 
[TT] . Computing the suboptimal structures is not difficult, however, the number 
of potentially interesting suboptimal conformation grows exponentially with the 
length of the nucleotide sequence. As one possible solution, Zucker and Sankoff 
suggested to restrict secondary structure folding to structures whose stacking 
regions (runs of consecutive brackets) extend maximally in both directions. 
This led to the definition of saturated structures for which no base pair can be 
added without violating the restrictions for secondary structures, see Figure [LTJ 
Extending the runs of consecutive brackets clears one mayor shortcoming of 
the traditional model, i.e. of secondary structures, which compared to native 
molecules tends to have ways too short stacking regions. Furthermore, in light 
of the asymptotic number of saturated structures determined by Clote et al. 
[1], the run time of RNA prediction algorithm should be substantially reduced 
if the search for suboptimal foldings is limited to saturated structures only, as 
observed by Bompfiinewerer et al. for so-called canonical structures [1]. 

Clote initiated the combinatorial study of saturated structures [3] which gets 
much more challenging than that for secondary structures from a mathematical 



Figure 1.1. Secondary structure (left) and its saturated counterpart 
(right) where three additional links have been added (highlighted in 
red) . The primary structure is given by the chain of vertices along the 
solid, pairs of nucleotides are represented by dotted edges. Note that 
3 and 4 cannot be paired since both are neighbored with respect to 
primary structure. 



point of view. He estimated the number of saturated structures by applying 
implicit function theory to the functional equations of its generating function 
S{z) m, i.e., 

-S{zfz^ - S{zfz\-2 + z^) + S{z){-1 + z^) + z{l + z) = 0, 

whereas the functional equation for secondary structures is relatively simple 
and given by 

T{z) = z + zT{z) + z'^T{z) + z'^T{zf, 

for T{z) the generating function of secondary structures. Of course we observe 
variations of local parameters of the structures like the length and number 
of stacking regions or the length and number of loops (runs of symbols .). 
However, it is not at all obvious whether saturation has an effect on the overall 
shape of the structures. One parameter which allows to measure their overall 
shape is the so-called order, originally introduced by Waterman in 1978 for 
algorithmic purposes. A secondary structure s (saturated or not, represented 
in dot-bracket form) has order p if we need exactly p iterations of deleting 
all maximal substrings {'')^ within cp^s) in parallel to find the empty string 
e. Here (p is the homomorphism implied by (j){{) = (, <j){)) =) and (j){.) = e. 
Accordingly, the order provides information on the (balanced) nesting-depth 
of so-called hairpin- loops (substring with <j)-iui&ge (")" which e.g. holds for the 
structures depicted in Figure II. ip and thus on the overall complexity of the 
structure (it was used by algorithms to increasingly consider more and more 
complex foldings starting with a search space restricted to structures of order 

In this paper we show one way to approach the combinatorics of saturated 
structures and especially how to analyze their order. This - besides the moti- 
vating remarks from above - is of special interest since in the past it has been 
proven to be one of the most demanding parameters to address (for secondary 
structures it has been an open problem for more than 20 years to find asymp- 
totic results for the number of structures of given order and similar). For that 



purpose we discuss the generating function of saturated structures having order 
> p, denoted by Sp{z), from which we extract the information of the expected 
order of a saturated structure of given size. We find that in expectation the or- 
der behaves the same for secondary and saturated structures such that we may 
conclude that the traditional model (secondary structures) has already drawn 
the right picture and conclusions inferred from it with respect to the order (the 
overall shape) of a structure remain valid even if enforcing saturation (at least 
in expectation). 

The paper is organized as follows. We first present our main results. Af- 
terwards we describe a streamlined analysis with details delayed till the last 
sections (or the appendix). 

2. Main Results 

Let S{n) be the number of saturated RNA secondary structures of size n 
and Sp{n) be the number of saturated RNA secondary structures of size n and 
having order > p, then we set S,n to be the random variable having probability 

distribution 

_ Spin) - Sp+i{n) 

namely we select each saturated structure uniformly at random among the 
family of saturated RNA secondary structures of size n. Our main results are 
summarized as 

Theorem 2.1. The expected order of a saturated RNA secondary structure of 

size n is 

'log2_n' 

n 



EU = log4 n • 1 + O 



Theorem 12.11 indicates that although the saturation of secondary structures 
increases the expected number of paired bases (and therefore increases the num- 
ber of hairpin-loops possible) and scales down the search space, the complexity 
of the folding algorithm for saturated structures as given by the order stays 
almost the same. We may conclude that the traditional secondary structure 
model has already drawn the right picture and conclusions inferred from it 
with respect to the order of a structure (its overall shape) remain valid even if 
enforcing saturation (at least in expectation). 

Theorem 12.21 below proves ^„ is highly concentrated around the expected 
order E^„. 

Theorem 2.2. Assume we choose < x < (| — /3)log4n for arbitrary /3 > 0, 
then we have 



3. Road Map of the Proof 

In this section, we shall address the mayor steps and difficulties of analyzing 
the expected order of saturated structures by tools from analytic combinatorics 
E] . We start by deriving the key recursions for saturated structures of order 



Let S{z) (resp. S) be the generating function (resp. the family) of saturated 
RNA structures and R{z) (resp. JV) be the generating function (resp. the fam- 
ily) of saturated structures having the first and the last position paired, i.e., 
3i = (S) where the parenthesis represents the paired bases and R{z) = z'^S{z). 
Furthermore, let Sp{z) and Rp{z) represent the corresponding generating func- 
tion assuming order > p, p > 1. By decomposing the saturated structure into 
independent IR-type structures, we obtain the functional equation for S{z) 
°° 2 or \ _i_ 2 

(3.1) SW = g (1 + (.+ 1)(. + .^)) «(.)■-! = ^3^ + ij^±i^. 

Now, taking the order into account (omitting variable z for the ease of notation), 
we find the following recurrences for Sp and Rp+i, p > 1, 

Sp = Y.('^ + ii + l)iz + z^))iR'-iR-Rpy) 

Rp[l + 2z'^ + 2z-2R-2Rz-2Rz^ + R^ + {l + z + z'^ -R)Rp] 

(3.2) — 

Rp+i 
(3.3) 

^^'^' ~ -Rl + {3R - 3)Rl + {6R-3- 3R^ + z'^)Rp + (i? - 1)Pr ' 

where P = R^ + {z^ - 2)R^ + (1 - z'^)R - z^ - z'^ and Pr = dP/dR = 3R^ + 
2(z^ — 2)i? + (1 — z^) and the initial conditions are Ri = R and ^o = S. 

Unlike for secondary structureq_|, due to the non-local dependencies imposed 
for saturation neither the appropriate symbolic substitution nor the closed form 
solution of recurrence (13. 2p could possibly exist, for which we have to decode the 
information of expected order from the recurrence itself other than attempting 
to solve it. Therefore, the proof for the expected order of saturated structures 
consists of locating the dominant singularities of Sp{z) for p > 0, verifying 
the analytic continuation of Sp{z) for some A-domain, which guarantees the 
validness of integration along Hankel contour, see Figure 13. 1|. and finding the 
singular expansion of Sp{z) within the intersection of A-domain and a small 
neighborhood of the dominant singularity. Finally we apply a transfer theorem 
on the singular expansions of Sp{z) and S{z) to extract the n-th coefficient of 
Ylip>i ^p{z) and S{z), and conclude the expected order E^„ via 



{R-lY{R-Rp-lY 


-- R-z' 


oo 

^(1 + {i + l){z + z^)) {{R - RpY + {Rp - Rp+{)i 


xiR-RpY^^) +z + z^' 


i-R - z'^)Rl + {-3R + 3i?2 + 3i2^2 _ ^2)^2 



E^r, 



[z-]S{z) 



For secondary structures the expected order has been analyzed by making use of weh- 
known closed form representations of multivariate generating function for binary trees having 
Horton-Strahler number p. By the use of appropriate symbolic substitutions for the different 
variables the binary trees with Horton-Strahler number p were expanded into the secondary 
structures of order p and a closed form for the corresponding generating function followed [12] , 



The results on the deviation to the expected order follows similarly. 
Before we proceed, we present the Transfer Theorem by Flajolet and Odlyzko 
[6]. The central point of this theorem is to use of Cauchy's formula by inte- 
grating along the Hankel contour depicted in Figure 13. H which is guaranteed 
by the analytic continuation within a A-domain. We set 

A,^{M,(j)) = {z\ \z\ <M,z^zo,\aig{z-zo)\ >(/)} 

where M > zq and < (j) < ^. Let [/^^(r, </>) be the intersection of Azo{M,(j)) 
and the neighborhood of zq, i.e., 

Uzo{r,(p) = {z\ <\z- zo\ < r, |arg(2;- zo)\ > (j)}, 

then we have: 

Theorem 3.1. (Transfer Theorem) ^ Assume that f{z) is analytic within 
Ai{M,(j)), and for z & Ui{r,(f)), f{z) satisfies 



^1- 

Then we have [z^]f{z) = 0{n~2 ■ log2n). 

Theorem 13.11 assumes the dominant singularity is z = 1. However, the case 
of a dominant singularity at z = zq ^ 1, can always be boiled down to the case 
where z = 1 is the dominant singularity according to 



[z^Uiz] = zl ■ [z"]/ - 
In what follows we detail the steps that are needed for the singularity analysis 




Figure 3.1. Ai-domain (yellow) and Hankel contour (green): 
Transfer theorem applies Cauchy's formula by integrating along the 
Hankel contour, colored in green. The inner incomplete circle 3, to- 
gether with two rectilinear lines 2 and 4 mainly contribute to the 
integral. Here we assume the dominant singularity is at z = 1. 

of Ep>i'S'p(z). 

Step 1: Locate dominant singularities: We first observe that the domi- 
nant singularity of S{z) is unique since \z°^\S{z) / holds for arbitrary n and 



therefore S{z) is aperiodic [6]. Assume zq is the unique dominant singularity of 
S{z), then zq is also the unique dominant singularity of Sp{z) for p >0. Indeed, 
consider the field extension of the rational function field Q{z) induced by alge- 
braic functions Sp{z), we can inductively prove that [Q{Sp{z)) : Q{z)] = 3 based 
on its tower relation [Q{Sp{z)) : Q{z)] = [Q{Sp{z)) : Q{Sp-i{z))][Q{Sp^i{z)) : 
Q{z)] = [Q{Sp-i{z)) : Q{z)]. In other words, Sp{z) is an algebraic function of 
degree 3 over the field Q{z). Let S<p{z) be the generating function of saturated 
structures having order < p, similarly we can prove S<p{z) is rational and in 
view of Sp{z) = S{z) — 5<p_i(z), we can claim that Sp{z) {p > 0) have the 
same unique dominant singularity as S{z). Otherwise, suppose z = ^ < zq is 
the dominant singularity of Sp{z) and therefore Sp{'y) < Sp{zo) < S{zo) < oo, 
which contradicts to the fact that 6*^(7) = S{'j) — S<p-i{'y) = 00 since S<p-i{z) 
is a rational function and z = 7 must be one of the poles of S<p-i{z). Further- 
more, z = Zq is the unique dominant singularity of Sp{z) since [z^]Sp{z) / 
and Sp{z) is aperiodic. 

Lemma 3.1. Let zq be the unique dominant singularity of Sp{z) {p > 0), then 
zo Ri 0.424687. 

We apply the implicit function theorem on eq. (|3.ip to extract the unique 
dominant singularity of S{z), which is also the unique dominant singularity of 

Sp{z). 

Step 2: Establish the analytic continuation in some A^Q-domain: Since 
Sp{z) is an algebraic function of degree 3 over the rational function field Q{z), 
Sp{z) must be D-finite, which allows for analytic continuation in any A^p- 
domain containing zero |14j . 

Step 3: Singular expansion: We shall show the singular expansion of Sp{z) 
within UzQ{e,(p) for sufficiently small e > and < < ^. Our strategy is to 
transform the fractional form of the recursion for Rp{z) (eq. (j3.4p ) into "linear" 
form, based on the contributions of individual terms to the behavior of Rp{z) 
for different p. 

Case 1: p < pu = max {p : \Pr{R- 1)| < \^ ■ Rp\} for 02 = -3R + SR"^ + 
3Rz'^ - z"^. 

Lemma 3.2. Assume that z G UzQ{e,(j)) and a2 = —3R + 3R'^ + 3Rz^ — z^, then 

1 Pr{R - 1) 



Sp+i{z) = Sp{zo)2 ^ - — 

Pr{R - 1 



' 2P 



02 



+ 2P 



Pr{R-1) '' 



0-2 



holds for p < pm and Sp{zo) = s + 0(2 ^) where s > is constant. 

Case 2: p > pM- We continue analyzing the recurrence relations for Sp{z) and 
Rp{z) for p > PM. Let A'^ = -^ • R^ and B'^ = -^^R^ + '-^R^ - ^. 
Note that yl' — )• and i?' — )■ as p — )■ cxd and z E Uz(^{e, (p). Then we simply 
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have 



(3.5) 



Rp+i{z) 



Rp + Ap 



M^ + 2Rp + B'p 



We observe that B' and A' converge to faster than Rp as p — t- oo, and it 

only remains to determine the major contribution between Rp and ^^^ ~ ' 
from the denominator to the behavior of Rp for different p. Here we ah re- 
duce the recursions to the function h{x,fi,v) = -, % + ' from which we can 
prove /i(x,;U,z/) = /i(x,0,0) + 0(max{|^|, |i/|}) holds uniformly for x / ^ as 
max{|/x|, |z/|} — )• 0. In order to asymptotically solve eq. (j3.5p . we need to avoid 
Rp = g ^^ ~ , which may occur when p is sufficiently large. To this aim, we 



''P 2 a2 

select Ai > and A2 > such that for p < pM + A2, 



a2 



<\Rp\ and for 



P>Pm- Ai, \Rp\ > 

the phase transition around p 



PrJR-^) 



0-2 



. Lemma 

PM- 



below shows the "continuity" of 



Lemma 3.3. Assume z € Uzo{e,4>) and po = pM — Ai, then for arbitrary but 
fixed 5 < X2, we have uniformly for z and for < k < Xi + 5 that, 



'po+fe 



Pr{R-1) 

02 



ffl(fl-i) 



R 



PO 



+ 1 



2fc 



+ 



Pr{R - 1) 



02 



where 02 = SR + 3R'^ + 3Rz'^ 



z^. 



Lemma 3.4. Assume that z £ UzQ{e,(j)), there exists kq > A2 such that for 

P>PM + Ko, 



Sp+iiz) = O 



Pr{R - 1) 



«2 



exp(-ln2-2*') . 



Step 5: Transfer to coefficients: It only remains to translate the singular 
expansion of the function into an asymptotic estimate of its coefficients. 



Theorem 3.2. The expected order of a saturated secondary structures of size 
n is 

log2 n 
n 



Ee„ = log4n- 1 + 



Proof. We first analyze the expectation function F{z) = Yl,p>i ^pi^) ^^^ ^ ^ 
Uzo{e,(j))- According to Lemma 13.21 Lemma 13.31 and Lemma [331 we have for 



p>h 



P<PM 



7 _, Sp+i{2 

P>PM 



Y^ / No-p PM PrJR- 1) , 



P<PM 



X^'^p^^o)^ 



p>l 



2a2 
.p _ PM _ P/;(i^ - 1) 



O 



PR(ii - 1) 



+ 



0.2 

i^R(i2 - 1) 



a2 



Yl -5*^+1(2:)+ Y ^p+^^^) 

PM<P<PM+k-0 P>Pa/+K0 



o 



o 



Pr{R - 1) 



0.2 

Pr(.R - 1) 



p>pm+ko 



Pr{R - 1) 



02 



exp(-ln2-2P) 



02 



In combination of the cases p < pM and p > pM , we obtain 



F{z) = ^ 5p+i(z)+ j; 5p+i(z) + 5i(z) 



P<PM 



P>PM 



Y Sp+i(zo) + I S{z) - — Si{zo) j - -^ 



PM Pr{R-1) 



p>0 



+0 



2a2 



Pr{R - 1) 



02 



Recall that pj;/ is given by 

PM = max Ip : \Pr{R - 1)| < 



02 

-j-Rp 



and we need to find an appropriate representation for it. For z G UzQ{e,(p), 

By setting Fq = F{zo) and S{z) = S{zo) - ^^^^g^ + 



"2 



PM ~ - log 

0(2; — zo), we simplify F(z) into 



F(.) = F,-l^«'i'-"-»£^«'f'-^'+0 



-Zn 2a2 



Fo + 



log2 



J^fl(^-l) 



£12 



^o 2a2 
i^i?(i? - 1 



1-^ 



2a2 



+ 0{J1-- 

zo 



^„^\M^^,,JWi^^^oi,n 



Zn 2a2 



V 0.2 



Zo 



Fn 



Pz{zo) 



2PRn{zo)zr'~~'°^' 



z 
Zo 



zo 



r^)Hf^. 
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According to Theorem 13.11 (Transfer Theorem), for n > 1, the expected order 
of a saturated secondary structure is thus given by 



^" " MS(.)-r(-i) ^°^^" '° 'i2Pnn{zo)zi^ z,P^{z,) "^"° 

loggn 



log4 n • 1 + O 

whence the proof is complete. D 

Finally we discuss the large deviation of the random variable ^n- 

Theorem 3.3. Assume we choose < x < (^ — /3) log4n for arbitrary /3 > 0, 
then we have 

n\in-mn)\>x) = o{2-^). 

Proof. For p < log4 n, Lemma 13.21 in combination with the Transfer Theorem 
implies 

2P \ ^/ p 



nin>p) = l + 0[^\+0^^^ 



For p > log4n, Lemma 13.31 indicates that 

P(Cn >p) = Texp (- ^) ) for /?' > 0. 
Consequently the theorem follows. D 
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